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We present a quantitative measure of physical complexity, based on the amount of information 
required to build a given physical structure through self-assembly. Our procedure can be adapted 
to any given geometry, and thus to any given type of physical system. We illustrate our approach 
using self- assembling polyominoes, and demonstrate the breadth of its potential applications by 
quantifying the physical complexity of molecules and protein complexes. This measure is particularly 
well suited for the detection of symmetry and modularity in the underlying structure, and allows for 
a quantitative definition of structural modularity. Furthermore we use our approach to show that 
symmetric and modular structures are favoured in biological self-assembly, for example of protein 
complexes. Lastly, we also introduce the notions of joint, mutual and conditional complexity, which 
provide a useful distance measure between physical structures. 



ALGORITHMIC COMPLEXITY 

More than forty years ago, Kolmogorov [1] and Chaitin 
Q laid the foundations of algorithmic information theory, 
by introducing the concept of algorithmic information 
content, or Kolmogorov complexity, for a given string 
of information (sj. This measure of complexity is de- 
fined as the length of the shortest possible program on a 
universal computer that will output the string in ques- 
tion. Here we propose a conceptually analogous measure 
of the complexity of any connected physical structure. 
Instead of a universal computer which translates a pro- 
gram into a string of information, we consider a general 
framework of self-assembly rules, which act together to 
create a physical object. The 'program' now is our set of 
self-assembly building blocks and rules, the 'computer' is 
given by the physical interactions of the self-assembling 
building blocks, and the 'output' is the final structure. 
Using this approach we investigate the physical com- 
plexity of shapes in two and three dimensions, includ- 
ing polyominoes, molecules and protein complexes. Our 
work generalizes ideas first explored in 0, [H , and opens 
them up to a wide range of applications. Furthermore, 
in the context of protein complexes it offers the kind of 
biological application of information-theoretic concepts 
demanded in l6|. 



SELF-ASSEMBLY KIT 

There are many examples of self-assembling structures 
in physics, chemistry and biology Examples include 
thin films [sl, micelles [9], viruses 0,13 and DNA [12^ 
ITj . Our aim is to introduce a general framework for 



the theoretical study of self-assembling structures. This 
framework can be used to study the properties of real 
self-assembling systems, but, more generally, it can also 
be used to measure the physical complexity of any con- 
struct, self- assembling or not. The exact nature of the 
self-assembly framework depends on the underlying phys- 
ical system, but it always contains two basic ingredients: 
a set of building blocks and a set of rules. We shall call 
this combination an assembly kit S. Each building block 
i has fi interfaces, which typically are subject to geomet- 
ric constraints (depending on the physical system). At- 
tached to each interface j of a given building block i is an 
integer Xij ^ [1, •••^c]. The c possible values of these in- 
tegers are the colours of these interfaces. The number of 
distinct colourings of the building blocks depends entirely 
on the geometry of the problem. The second ingredient of 
the assembly kit is the set of rules, which takes the form 
of an interaction matrix between colours. In the simplest 
case this matrix is binary, where 1 signifies attraction and 
signifies no interaction at all. Many more sophisticated 
interaction matrices involving repulsion and a continuous 
spectrum of energies are easily imaginable. 

For any system of self-assembling particles we need to 
also specify a model for the actual assembly process. A 
convenient choice is a model assuming a single nucleus in 
solution [4], which makes the assumption that each dis- 
joint object has one fixed nucleus building block which 
is surrounded by a solution containing a freely moving 
population containing many copies of each type of build- 
ing block. Each time step (i) a fixed building block, (ii) 
a site adjacent site to it, (iii) a random rotational ori- 
entation, and (iv) a building block from the solution are 
chosen at random, and the new, randomly rotated build- 
ing block becomes fixed to its position if the rules allow it. 
Note that some assembly kits always assemble into the 
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FIG. 1: An example of deterministic and non-deterministic 
self-assembly kits, using simple 2D lattice structures (poly- 
ominoes). In both cases, colours A and B attract each other, 
but C attracts neither A nor B. No colour attracts itself. The 
kit on the left will always assemble into the cross shape while 
that on the right will assemble into an irregular cluster, as 
there are several ways in which the two blocks can attach. 

same shape - these we call 'deterministic' - while ones 
which contain ambiguous rules are 'non-deterministic'. 
See Figure [T] for an example of a deterministic and a 
non-deterministic self-assembly kit. 

As a simple example of a self-assembling system, we 
will consider self-assembling polyominoes. A polyomino 
(also known as a lattice animal) is a set of connected 
sites on a (typically square) lattice [isj . These connected 
sites are our self-assembly building blocks. Every build- 
ing block has four sides (so that /i = 4 for all i), which are 
painted with one of c colours. These colours can attract 
each other or not, as encoded in a c x c binary interaction 
matrix. Each distinct way of colouring a building block 
corresponds to a different building block type. We do 
not regard rotated colourings as distinct. The geometry 
of the 2D lattice gives rise to a particular set of build- 
ing block colourings in the context of self-assembly. If 
we have c colours, the total number of such colourings is 

These particular colourings are also known as necklaces^ 
which can be defined as equivalence classes of strings un- 
der rotation [19]. The definition of necklaces used here 
assumes that the building blocks have a fixed chirality - 
in other words that the necklaces which the colours form 
on the building blocks are fixed, [i^ 

THE MINIMUM KIT 

Every deterministic assembly kit Sa^ which always as- 
sembles into a structure A, requires a certain amount of 



information I{Sa) to describe it in some given language 
L. Our aim is to minimize this quantity, as we define 
the length of the description of the minimum assembly 
kit Sa as the complexity K{A) of structure A: 

K{A) = I{SA)=mmI{SA) 

Sa 

in analogy to the concept of Kolmogorov complexity. 
Any symmetry or modularity which the structure A con- 
tains decreases the amount of information required to 
describe the structure and will therefore be reflected in 
its minimum assembly kit Sa^ and by extension in the 
value of K{A). 

If a minimum assembly kit is deterministic, an inter- 
action matrix A (with elements aij) between a total of c 
colours, of which Cs self-interact, can be rewritten as: 

aij = [1 - (imod2)]J^(^+i) + (z mod 2)J^(^_i) 

for i < c — Cs, and aij = 6ij otherwise, so that one colour 
always only interacts with one other colour. With this 
constraint, the amount of information, in bits, required 
to describe a self-assembly kit Sa^ with b building block 
types, is: 

b 

I{Sa) = log2(Cs + 1) + ^ Q log2 C + log2 Fi (1) 

i=l 

The first term relates to the number of self-interacting 
colours, the second measures the information required to 
describe which q colours out of the total of c colours 
appear on building block z, and the third term log2 Fi 
measures the information describing the distinct arrange- 
ment of the Ci colours on the fi faces of building block 
i. For a general building block with fi labelled faces, Fi 
takes the form of: 

Ficu.)= EE- E 

where E' = XljLi^ ^j*^ ^j*^ signify the number 

of times colour j occurs on block i. 

For polyominoes Fi = F{ci) = N^., where N^. is the 
number of necklaces with exactly Ci colours, given by 

k=l ^ ^ 

with N[ = 1. It follows that = 4, = 9, and 
7V4 = 6. As before, the complexity K{A) of polyomino A 
is the minimum of I{Sa) over all possible assembly kits 
Sa' Note that Wang tiles [20] are a special case of self- 
assembling polyominoes. The tile system described in ^ 
is also similar to our framework for the case of polyomi- 
nos, but (like Wang tiles) only considers self- interacting 
colours, and treats rotated tiles as distinct. As a result 
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our encoding, based on necklaces, makes symmetry and 
modularity in the structure more directly measurable. 

If the faces are geometrically unconstrained - as one 
would imagine for a node with a set of freely moving 
links - and hence unlabelled, we would only need to spec- 
ify how much there is of each colour. This can be written 
using Fi = YVj' k^^^ , so that log2 Fi = log2 k^j^ . How- 
ever, this only works under the condition that multiple 
connections between the same pair of building blocks are 
prohibited. 

The general algorithm we use to find the minimum 
assembly kit and thus the complexity for polyomi- 
noes and other structures is described in the following 
section. 



A GENERAL MEASURE OF STRUCTURAL 
COMPLEXITY 

Below we describe a general algorithm for minimizing 
the assembly kit size for a connected physical structure 
without relying on steric effects. Taking these into ac- 
count can minimize the assembly kit even further, but 
their computation is highly dependent on the geometry of 
the system and in most cases non-trivial (see Discussion). 
Note also that in some structures, such as polyominoes, 
some edges of the contact graph can be redundant in the 
context of the assembly process. Whether contact graph 
edges in general can be redundant or not depends on the 
nature of the structure and the assumptions connected 
to the self-assembly of that structure (see Discussion). 
Similarly, when interfaces are defined by geometry, as 
for the four sides of a polyomino building block, it makes 
sense to introduce a neutral colour {v = 1 below). In sys- 
tems with a varying number of interfaces on the building 
blocks, neutral colours are usually not required {v = {)). 
To minimize the assembly kit we take the following steps: 

1. Divide the structure into building blocks (usually 
a natural division). The number of building blocks 
is the size of the structure, denoted z. 

2. Determine the equivalence of these units in terms 
of any additional criteria (e.g. types of atoms, pro- 
teins). This categorization is the species of build- 
ing block. 

3. Establish a contact graph a^j for the units (in some 
cases, such as molecules, this may require setting a 
distance cutoff). 

4. // edges can he redundant: Consider the space of all 
spanning subgraphs of this graph. 

5. For the contact graph (in the case of no redundant 
edges) or each subgraph (if redundant edges exist): 



(a) Classify the (sub) graph according to the num- 
ber of connections and (depending on the ge- 
ometry) the arrangement of connections. 

(b) Label all nodes which are not yet labelled 
and which have exactly one unlabelled node 
among their neighbours. The new labels dis- 
tinguish nodes according to their species as 
well as the topologically distinct label distri- 
butions among their neighbours. 

(c) Repeat step [5b] until all nodes are labelled or 
no more nodes can be labelled. 

(d) All labelled nodes we define as category 1 
nodes and any remaining unlabelled nodes 
(i.e. nodes with at least two unlabelled neigh- 
bours) are defined as category 2 nodes. 

(e) Label all category 2 nodes simultaneously ac- 
cording to their neighbourhoods. 

(f) Repeat step [Sel using the previous labellings 
to distinguish neighbourhoods, until labellings 
are stable. 

(g) These final labels, for nodes in both categories, 
denote the building block types. The number 
of final labels, or types, is h. These can be 
subdivided in to h\ category 1 building block 
types and 62 category 2 building block types. 
The category 2 type of block i is denoted ti. 

(h) The degree of each building block type i in the 
contact graph (or subgraph) is the number of 
its interfaces fi. 

(i) The total number of colours, including v G 
{0, 1} neutral colours, is c = 2{hi — 1) + + 

E-,i=i (1 - nM=i(l - {^kAtJiu))). The 
sum expression gives the number of different 
types of interfaces which occur between cate- 
gory 2 building block types [4^. The number 
of colours Ci on building block i is equal to the 
number of building block types in its contact 
graph neighbour set. 

(j) Using 6, c, {fi} and {q} in equation ([!]), cal- 
culate the information / required to specify 
this assembly kit, and thus the complexity K 
of the structure. 

6. // edges can he redundant: Minimize this quantity 
over all spanning subgraphs. 

Figure [2] illustrates the crucial steps [5b] to [5j] for a poly- 
omino. Figure [3] illustrates how the complexity value K 
refiects symmetry and modularity present in the struc- 
ture. 
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FIG. 2: An illustration of the crucial steps 5b to 5j of the al- 
gorithm for minimizing the assembly kit size, in this case for a 
polyomino. In every iteration of category 1 labellings (LEFT), 
all unlabelled nodes with exactly one unlabelled neighbour are 
given labels which distinguish them according to their topo- 
logically distinct neighbourhoods of unlabelled and labelled 
tiles. This procedure is repeated until no more blocks can be 
labelled in this way. The remaining blocks are given category 
2 labellings (RIGHT) which are applied simultaneously, with 
each label distinguishing the topological neighbourhoods of 
the tiles in the previous iteration. Note that in the last itera- 
tion the labellings have stabilized, and only the interfaces of 
the building block types are updated. For structures in which 
edges can be redundant, this operation can be performed for 
all spanning subgraphs of the structure's connectivity graph, 
which further reduces the complexity. (In polyominoes, edges 
can be redundant, but there are no spanning subgraphs in the 
above example.) 



APPLICATIONS 

The self-assembly approach can be used to calculate 
complexity values for any physical structure. In order 
to demonstrate the broad range of potential applications 
we determine the complexity of (a) molecules and (b) 
protein complexes. 

The problem of molecular complexity has been stud- 
ied extensively over the past seventy years, starting with 
work by Polya 21| and Rashevsky among others [22|, l23| , 
and culminating in a seminal paper by Bertz [2^ . These 
approaches are based on Shannon entropy rather than 
algorithmic information theory and focus on symmetries 
rather than the more general concept of modularity. In 
molecules, we take atoms to be the building blocks and 
chemical bonds to be their interfaces. Simple molecules, 
such as those in Figure HI for which we are only interested 
in the bond connectivity, are an example of a structure 
in which none of the edges can be regarded as redundant. 
This is because, unlike for polyominoes, we are not as- 
suming any inherent geometry for the building blocks. If 
two atoms play the same self-assembly role but represent 
atoms of different atomic species, they must be differen- 
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FIG. 3: The complexity values of these four polyomino shapes 
illustrate why the self-assembly approach is an effective way of 
measuring symmetry and modularity without requiring prior 
assumptions. If two shapes are of equal size, the one with 
more symmetry and modularity has a lower complexity value 
- compare A with B, and C with D. If on the other hand, 
two shapes are of similar complexity, but of different size, the 
larger one will be more symmetric or modular (compare B 
and C). 



tiated. This also goes for atoms connected by different 
bond types. For example, in glutamine (see Figure |4]), 
the oxygen atom connected with a double bond is a leaf 
of the self-assembly tree just like any of the (implicit) hy- 
drogen atoms, but it requires a separate building block. 
The two molecules in our example of Figure H] are the 
amino acid glutamine and the explosive nitroglycerine, 
which both consist of 20 atoms. Nitroglycerine however 
exhibits a much higher degree of modularity, with its 
three NO3 groups, and therefore has a much lower com- 
plexity of = 55.3 bits than the glutamine, for which the 
value is K = 94.7 bits. Note that nitroglycerine does not 
exhibit simple three- fold symmetry, but a more subtle, 
hierarchical modularity. Such structural features would 
be harder to discover using traditional approaches to the 
measurement of molecular complexity |2 21424]. which do 
not take a self-assembly perspective and rely on Shannon 
entropy rather than Kolmogorov complexity as a measure 
of complexity. 

Many important biochemical structures are protein 
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FIG. 4: Measuring the complexity of molecules - The explo- 
sive nitroglycerine (top) and the amino acid glutamine (bot- 
tom) both consist of 20 atoms, but differ greatly in complex- 
ity. The highly modular structure of nitroglycerine with its 
three NO3 groups means that its complexity value at 52.2 
bits, is little more than half that of glutamine {K — 91.0 
bits). Note that nitroglycerine does not have simple three- 
fold symmetry, but a more subtle modular structure, which 
the self-assembly approach fully reveals. Note that we do not 
consider neutral colours in this structure (z/ = 0). 



complexes^ consisting of several individually formed and 
folded protein subunits bound together to produce func- 
tional cellular machinery. These subunits may include 
different types of protein and several copies of the same 
protein. The physical structure of protein complexes, as 
with protein themselves, is important in determining the 
functionality of the complex. The manner in which the 
subunits bond to form the final complex is known as the 
quaternary structure of the complex. The 3DComplex 
database [25] contains a description of the quaternary 
structures of thousands of protein complexes, in terms of 
subunit type and inter-subunit bonding. If we have two 
proteins which play the same role in the self-assembling 
structure but are different proteins, we can choose to 
count them as two different building blocks (analogous to 
the aforementioned distinction between atomic species in 
molecules). In the following analyses we are only inter- 
ested in the connectivity of proteins (equivalent to the QS 
Topology level in the SDComplex database), and there- 
fore do not distinguish between different proteins. The 
two protein complexes in our example of Figure [5] are 
a chaperonin complex {E. coli chaperonin GroEL; PDB 
identifier: loel) and an allergen complex (P. pratense al- 
lergen PHL P 6; PDB identifier: Inlx). Both consist of 
14 proteins, but the former displays a much higher de- 
gree of symmetry and a much lower complexity value of 





b) 1nlx {P. pratense allergen PHL P 6) 





FIG. 5: We measure the complexity of two protein complexes, 
with PDB identifiers loel (a chaperonin, top) and Inlx (an al- 
lergen, bottom), which have 14 proteins each. The symmetry 
of the chaperonin complex means that it has a much lower 
complexity value of K = 31.5 bits, compared to K = 50.2 
bits for the allergen complex. Note that we are assuming 
non-redundant edges in this calculation, so that all building 
blocks of the chaperonin complex are category 2 and all build- 
ing blocks of the allergen complex are category 1. Further- 
more we do not consider neutral colours (z/ = 0), and in the 
case of the chaperonin complex we have three self- interacting 
colours (cs = 3). Note also that both complexes are homo- 
mers, i.e. they only have one type of subunit. 



K = 31.5 bits, versus K = 50.2 bits for the allergen 
(which is still somewhat modular). 

More complex protein structures require more unique 
inter-subunit bonds types, compared to less complex 
structures which can re-use bonds and be constructed 
through simple repetition of subunits. As an increase 
in bond types corresponds biologically to the presence 
of more unique bonding sites on subunit proteins, more 
complex protein structures can be thought of as requir- 
ing more evolutionary innovation to produce and would 
therefore be expected to occur less frequently in biolog- 
ical organisms [261, • This hypothesis is confirmed by 
Figure [6l which shows a histogram of complexity val- 
ues - normalized by the size of the protein complex, to 
avoid size effects - for the 15733 protein complexes in 
the 3DComplex database [IBj. The distribution closely 
(i^2 _ Q 93) follows a power-law decay. 

In both of these cases - molecules and protein com- 
plexes - we assume geometrically unconstrained faces for 
the building blocks; in other words, we use Fi = YYj' kj. 
While the chemical bonds of atoms and the interfaces of 
proteins are in fact usually constrained, this information 
is not part of the structural formula of the molecule or 
the contact graph of the protein complex. If this ad- 
ditional level of resolution is required, a more realistic 
self-assembly model can be constructed, based on the 
exact three-dimensional characteristics of the atoms or 
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FIG. 6: (Colour online) Histogram of protein quaternary 
structure assembly complexity with frequency of occurrence 
in the 3D Complex database. Insets illustrate two pairs of 
equally sized structures with high and low complexity val- 
ues. Igeh, li3q, lq2v, and lohh are the PDB identifiers of 
the complexes. The plot has an — 0.93 correlation with a 
power law decay. Note that in this case we do not distinguish 
between different types of subunit. 
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FIG. 7: (Colour online) The position of the 15733 protein 
complexes from [2^ in the space of h (number of building 
block types) and z (size of the complex). Many protein com- 
plexes are highly modular, and this is true across a wide range 
of sizes. In this plot complexes of equal modularity m — z/h 
lie on a diagonal line with positive gradient. The lines are 
shown for m =1, 2, and 10 {b/z =1, 0.5, and 0.1). The sizes 
of the circles show how many complexes lie at a given position 
(z,6). The insets show two examples (with PDB identifiers 
Ikyo and lb5s), with high and low modularities. 



proteins, and using the F{ci^fi) term specified above. 



MODULARITY 



The self-assembly perspective provides an intuitive def- 
inition of the modularity of a structure: If part of the 
structure appears several times, it still only needs to be 
encoded once. This is why modularity and symmetry 
(being a special case of modularity) lead to more efficient 
self-assembly kits and a lower value of the complexity 
measure K. Formally we can define the modularity m of 
a structure of size z as the average number of times one 
of the b different building block types in the minimum 
assembly kit is used in the structure, which is simply: 



m 



We can furthermore define a module formally as a con- 
nected set of building blocks which appears more than 
once in a given structure. Note that modules can over- 
lap: A subset of a module could form another module, 
appearing a different number of times than the whole 
module. The molecule in Figure illustrates such a 
case. 

The majority of protein complexes in the SDComplex 
database show high modularity values (Figure [7j) with 
a common trend observable along the b/z = 0.5 line, 
indicating many proteins consist of structures involving 
two copies of all constituent subunit s. 

To further illustrate how the complexity K and the 
modularity m measure the physical complexity of protein 



complexes, we consider two of the outliers in the com- 
plexity and modularity histograms, the high-complexity 
lohh (Figure [6]) and high- modularity lb5s (Figure [7]). 
lohh consists of two copies of bovine Fi-ATPase (itself a 
protein complex) in complex with its regulatory protein 
IFi[39]. The regulatory protein binds simultaneously to 
both copies of the main complex, but slightly asymmetri- 
cally, leading to asymmetric interactions being recorded 
in the 3DComplex database. This asymmetry results in 
extra information being required to describe the com- 
bined quaternary structure, and the observed high com- 
plexity value. Ib5s is a multienzyme complex consist- 
ing of multiple copies of dihydrolipoyl acetyletransferase 
(E2p)[40]. The E2p protein has the potential to oc cupy 
quasi-equivalent positions, as seen in virus structures [4l|, 
and is also observed to form cubic complexes. The highly- 
modular, dodecahedral structure exhibited in lb5s is an 
efficient way of grouping many copies of an active pro- 
tein in a geometry that facilitates enzymatic activity: the 
large windows in the structure allow passage of the sub- 
strate and product between the inner cavity and the sub- 
strate. The structure of the protein subunits allows this 
structure to be realised with just one building block type, 
resulting in high modularity. 



7 



JOINT, CONDITIONAL, AND MUTUAL 
COMPLEXITY 

If we have two structures A and B with minimum 
assembly kits Sa and Sb^ then the joint minimum as- 
sembly kit Sa,b is the minimum kit which can assem- 
ble both structures if an appropriate subset of building 
blocks is chosen. The amount of information required to 
describe this kit is the joint complexity K{A, B) of A and 
B. This definition can easily be generalized to more than 
two structures. 

Let us define as the subset of Sa,b which forms 
structure A, and S'^ as the subset of Sa,b which forms 
structure B (note that e.g. Sa is not necessarily equal to 
S'j^ due to the colour minimization), so that Sa,b = U 
S'^. Furthermore, let us define the conditional minimum 
assembly kit Sa\b as the set of building blocks we need 
in addition to S'^ in order to form structure A. Then we 
can write: 

Sa\b = Sa,b\Sb 

where \ denotes the set theoretic difference operation. 
The definition of Sb\a follows accordingly. Hence we can 
also define a conditional complexity K{A\B)^ which is the 
amount of information needed to describe the building 
blocks in Sa\b- Because the way we describe the assem- 
bly kit is additive in the number of building blocks, we 
can write 

K{A\B) = K{A,B) - K\B) 

since K'{B) is the information required to describe the 
building blocks in S'^. The relationship between K{B) 
and K'{B) is given by 



Polyominoes 



Amino acids 



K\B)=K{B)^Y.''dog, 



ca,b 
cb 



where ca,b is the total number of colours in Sa,b and 
is the total number of colours in Sb- Because of the 
minimization of colours, ca,b = max(cA,CB). Hence, if 
CB > CA, then K\B) = K{B). 

Similarly, we can define a mutual minimum assembly 
kit Sa:B^ which corresponds to the intersection 

Sa:B = Sa^ Sb = Sa\Sa\b = Sb\Sb\a 

From this follows the mutual complexity 

K{A:B) = K'{A)-K{A\B) = K'{B)-K{B\A) 
= K'Ia)^K\B)-K{A,B) 

In order to account for the relative sizes of the struc- 
tures we compare using these measures, we can define 
relative versions of the above quantities. These are rela- 
tive conditional complexity: 



K^^\A\B) 



K{A\B) 
K'{B) 



A) 




B) 




FIG. 8: POLYOMINOES (left): The two polyominoes share 
many building block types, with the only two unique ones be- 
ing blocks 5 and 6 (marked in grey). Hence, the joint set is 
Sa,b = {1,2,3,4,5,6}, the mutual set is Sa-.b = {1,2,3,4} 
and the conditional sets are: Sa\b — {5} and Sb\a — {6}. 
Building block 5 contributes K{A\B) = 2 logs 9 + 2 = 8.4 bits 
to the complexity K' {A) of the A shape, while block 6 con- 
tributes k[b\A) = 41og2 9 = 12.7 bits to K'{B). It follows 
therefore that the joint complexity is K{A, B) — 67.4 bits and 
the mutual complexity is K[A : B) — 46.4 bits, compared 
to the standalone values of K{A) = K' {A) = 54.7 bits and 
K{B) = K'{B) = 59.1 bits (see Figure E]). AMINO ACIDS 
(right): The two amino acid molecules asparagine (top, C) 
and glutamine (bottom, D) share the amino (NH2) and car- 
boxyl (CO2H) groups common to all amino acids, as well as 
the carboxamide group (CONH2). In a self-assembly frame- 
work these two structures have complexities of K[Asn) — 74.3 
bits and K{Gln) = 91 bits. While K' {Gin) = K{Gln), 
we have K' {Asn) = 78.0 bits. Because the two molecules 
share three groups, their joint complexity is not much larger 
than their individual complexities, at K{Asn^ Gin) = 104.0 
bits, and their mutual complexity is not much smaller, at 
K{Asn : Gin) = 65 bits, than the complexities of the individ- 
ual molecules. Their conditional complexities are correspond- 
ingly low, at K{Asn\Gln) = 13 bits and K{Gln\Asn) = 26 
bits. The conditional complexities give the amount of infor- 
mation required to describe the building blocks (atoms) which 
are unique (in their self-assembly role) to the given amino 
acid. These atoms are marked with grey circles. 



and the relative mutual complexity 

K{A : B) 



K'^''\A : B) 



K{A,B) 



Note that the latter measure resembles the Jaccard index 
jiij. For an illustration of joint, mutual and conditional 
complexity, see Figure [8l 
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DISCUSSION 

Steric effects - For structures which contain loop struc- 
tures formed by repeating units, it is possible to exploit 
steric effects in order to reduce the size of the assembly kit 
below the minimum size found by our algorithm (which 
expHcitly excluded such effects in its definition). An ex- 
ample of a steric effect would be a polyomino which is 
self-limiting in a deterministic way, purely because of the 
geometric constraints of the building blocks. As long as 
each distinct type of loop structure is formed by building 
blocks of a distinct species (or set of species), the amount 
of information required to describe this structure can be 
taken to be the same as that required to describe an in- 
finite chain consisting of the same elements. A simple 
example is given in Figure [9l The crucial assumption 
which has to hold for this simplification to work is that 
the geometry of the loop is specified by the species (and, 
by extension, the geometry) of the building block. For 
proteins as building blocks of protein complexes, this is a 
very reasonable assumption. In the case of molecules 
it would furthermore be possible to simplify the self- 
assembly kit by introducing building blocks representing 
common small loop structures, such as carbon rings. 

Multiple nuclei - In principle one could consider be- 
ginning the self-assembly with multiple nuclei in place. 
Multiple nuclei may, through steric hindrance or mod- 
ular repetition, be used to achieve certain structures in 
a more efficient way, using fewer building blocks than a 
single nucleus would require. This reduction in complex- 
ity may however be countered in practical applications 
by the difficulty of achieving the required precise rela- 
tive displacements of nucleus particles. It is because of 
these reasons that we have concentrated on a single nu- 
cleus model, as the positioning of multiple nuclei makes 
it much more difficult to construct a general measure of 
complexity. 

Within the single nucleus category, we further distin- 
guish between structure with a specified nucleus block 
and those with general nucleus blocks. The former case 
encompasses those assembly kits which are guaranteed to 
produce a given output structure if and only if a specified 
block is used as the nucleus (in other words, this block 
is placed on the substrate before other blocks are intro- 
duced to the system). General- nucleus assembly kits by 
contrast will form the same output structure regardless of 
which block is placed first. See Figure [TOl for an illustra- 
tion how specifying a nucleus can reduce the complexity 
of a assembly kit. 

Which of these classes to employ in a study depends 
on the motivating context of the self-assembling system 
under consideration. If modelling assembly in a diffusion- 
dominated environment, for example, the order in which 
interacting particles meet cannot be specified, so the 
general- nucleus model is more appropriate. In a con- 



2 12 12 12 1 



A I 1 I B I 2 I B 



A I 1 I A B I 2 I B 



FIG. 9: A simple example of a steric effect. The two blocks 
1 and 2 have colours A and B on their interfaces. These 
colours attract each other. All other faces are neutral. Gertain 
arrangements of colours will lead to self-delimiting structures 
purely because of the geometry of the building blocks. The 
complexity of such structures can be taken to be the same 
as that of an infinite chain consisting of the same sequence of 
blocks, but only if each loop structure inside a bigger structure 
has a distinct (set of) species of building blocks. 
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FIG. 10: Illustration of nuclei placement. (Top:) If we specify ei- 
ther of the two starred blocks as nuclei, deterministic bonding will 
result. However, if any other block is used as the nucleus, bond- 
ing will be non-deterministic, as both the {1, 0, 0, 4} and {1, 0, 5, 0} 
blocks can join the open 2 edges that will form. This self-assembly 
kit has a complexity of K = 42.4 bits. (Bottom:) A general nucleus 
system to produce the same structure, illustrating the required in- 
crease in complexity (K = 98.1 bits). 



trolled environment where a nucleus can be placed to 
initiate assembly, the single- nucleus model is applicable. 
The two cases correspond to different 'languages' being 
used to measure complexity, and so care must be taken 
in comparative studies to only compare numerical com- 
plexity values from within one class. 

Kolmogorov complexity - Our approach to measuring 
physical complexity is motivated by the concept of Kol- 



9 



mogorov complexity. It is however important to note that 
while Kolmogorov complexity itself is uncomputable due 
to the Halting problem [sl, our minimum is not. This is 
because the runtime of a finite computer program with 
finite output can be infinite, while the assembly time of 
a finite shape is always finite [4]. It is possible to de- 
fine the actual Kolmogorov complexity of a shape [5], 
but this is uncomputable. Our computable complexity 
measure K{A) forms a bound on this unattainable quan- 
tity, and is dependent on the way in which we encode 
the description of the assembly kit. It therefore is useful 
for the analysis, classification and comparison of physical 
structures, as long as we use a consistent encoding. 



CONCLUSION 

We present a general approach for measuring the phys- 
ical complexity of any connected structure, using the lan- 
guage of self-assembly. This approach is capable of de- 
tecting symmetry and modularity in a given structure, 
because these features significantly decrease the size of 
the required self-assembly instruction set. It therefore 
provides a powerful tool for automated classification and 
categorization of physical structures. In addition, the 
connection between self-assembly and complexity is an 
argument for the ubiquity of modular and symmetric fea- 
tures in biological systems: Since many such systems self- 
assemble, evolving sets of self-assembly instructions are 
likely to yield symmetric and modular structures, as the 
instructions for these are more efficient to evolve. 
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For free necklaces, which represent building blocks with 
no fixed chirality there are Mc = {c^ + 2c^ + 3c^ + 2c) /8 
necklaces [3]. In general we will assume fixed chirality. 
Heterogeneous interfaces are double- counted as, unlike 
homogeneous interfaces, they require two colours. 



